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Acronyms 


Application specific integrated circuit (ASIC) 
Block random access memory (BRAM) 
Block Triple Modular Redundancy (BTMR) 
Clock (CLK or CLKB) 

Combinatorial logic (CL) 

Computer aided design (CAD) 

Configurable Logic Block (CLB) 
Configuration cross section (P configuration) 
Digital Signal Processing Block (DSP) 
Distributed triple modular redundancy (DTMR) 
Dual interlocked cell (DICE) 

Dual redundancy (DR) 

Edge-triggered flip-flops (DFFs) 
Equivalence Checking (EC) 

Error detection and correction (EDAC) 

Field programmable gate array (FPGA) 
Finite state machine (FSM) 

Flip-flop SEU cross section (Pprr5seu) 
Functional logic cross section (PrunctionatLogic) 
Gate Level Netlist (EDF, EDIF, GLN) 

Global triple modular redundancy (GTMR) 
Hardware Description Language (HDL) 
Input — output (I/O) 

Linear energy transfer (LET) 

Local triple modular redundancy (LTMR) 
Look up table (LUT) 

Mean Fluence to failure (MFTF) 

Mean Time to Failure (MTTF) 


NASA Electronic Parts and Packaging (NEPP) 
Negative doped with electrons (N*) 
Operational frequency (fs) 

Power on reset (POR) 

Place and Route (PR) 

Positive doped with holes (P*) 

Radiation Effects and Analysis Group (REAG) 
Single event functional interrupt (SEFI) 
Single event functional interrupt cross section (Ps-,)) 
Single event effects (SEEs) 

Single event latch-up (SEL) 

Single event transient (SET) 

Single event transient cross section (Ps_7-5s¢u) 
Single event upset (SEU) 

Single event upset cross-section (Os-,) 

Static random access memory (SRAM) 
System cross section (P(fS) error) 

System ona chip (SOC) 

Time delay (ty) 

Temporal redundancy (TR) 

Total ionizing dose (TID) 

Voltage connected to positive rail (Vpp) 
Voltage connected to ground rail (Vs;) 
Windowed shift register (WSR) 
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Agenda 


Single Event Upsets (SEUs) in Digital Devices. 
Single Event Upsets and FPGA Configuration. 
Single Event Upsets in FPGA Data Paths. 
Fail-Safe Strategies for Critical Applications. 
Dual Redundancy: 

— Lockstep and 

— Separate systems. 
Cold Sparing. 
Triple modular redundancy (TMR): 

— Block TMR (BTMR), 

— Local TMR (LTMR), 

— Distributed TMR (DTMR), and 

— Global TMR (GTMR). 
Fail-Safe State Machines. 
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SEUs versus Total lonizing Dose rib) & 


¢ The two are commonly confused. 


— TID is dose that can cause device failure from 
exposure to ionizing particles (mostly protons 
and electrons) over time. 

— SETs and SEUs have nothing to do with dose 
over time. 


¢ One particle’s passage through a sensitive region ofa 
device. 


¢ Causes ionization and can cause a transistor to change 
it’s state. 
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How SEUs Affect FPGAs sal 


e SEU and SET error signatures vary between FPGA devices: 
— Temporary glitch (transient), 
— Change of state (in correct state machine transitions), 
— Global upsets: Loss of clock or unexpected reset, 
— Route breakage (no signal can get through), and 
Configuration corruption. 


e The question is how to avoid system failure and the answer 
depends on the following: 
— The system’s requirements and the definition of failure, 
— The target device and its surrounding circuitry susceptibility, 
Implemented fail-safe strategies, 
— Reliable design practices, 
— Radiation environment, and 
Trade space and decided risk. 
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FPGA SEU Categorization as Defined Sy 
by NASA Goddard REAG: 


Probabilities are with respect to fluence 
(SEU cross sections os-y) 
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Sequential and 


Combinatorial Global Routes 
logic (CL) in and Hidden 
data path Logic 


SEU Testing is required in order to characterize the 
OseyS for each of FPGA categories. 
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Preliminary Design Considerations for 
Mitigation And Trade Space 


Determine Most Susceptible Components: 
P (fs ) vo. oc FY Configuration +P (fs ) functionalLogic 1 Pert 


¢ Does the designer need to add 
mitigation? 

a ' ¢ Will there be compromises? 

— Performance and speed, 

— Power, 

— Schedule 

— Mitigating the susceptible 

components? 

Reliability (working and mitigating 

as expected)? 

Impact to speed, power, area, reliability, and 


= = 
schedule are important questions to ask. 
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wh 
Programmable Switch Implementation and 
SEU Susceptibility 


ANTIFUSE (one time programmable) 
Yeemeta SRAM (reprogrammable) 
Q 


T25JV04 WH47 HPO/VPD 


Read or Write Q 
a 
Data 


Programming Bit 


84.8nm 
SEI 7.0KV 80,000 100nm WDO2.8mm 
Antifuse X 
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Configuration SEU Test Results and So 
the REAG FPGA SEU Model 


P ( f S - oc Configurat ion +P ( f s ) functional Logic + P SEFI 
FPGA REAG Model 


Oxelavilelere-iareyal 
Type 
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eee P ( f s lar oc FE Configuration 
Flash 
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Hardened SRAM 
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What Does The Last Slide Mean? sal 


FPGA Susceptibility 
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Type Configuration 


Antifuse Configuration has been designated as hard regarding SEEs. 
Susceptibilities only exist in the data paths and global routes. 
However, global routes are hardened and have a low SEU 
susceptibility. 


SRAM (non- Configuration has been designated as the most susceptible portion 

mitigated) of circuitry. All other upsets (except for global routes) are too 
Statistically insignificant to take into account. E.g., it is a waste of 
time to study data path transients, however clock transient studies 
are significant. 

Flash Configuration has been designated as hard (but NOT immune) regarding 


SEEs. Susceptibilities also exist in the data paths and global routes (e.g., 
clocks and resets). 


Hardened Configuration has been designated as hardened (but NOT hard) 
SRAM regarding SEEs. Susceptibilities also exist in the data paths and 
global routes (e.g., clocks and resets). 
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, ,, =xample: Routing Configuration Ss 
Upsets in a Xilinx Virtex FPGA 


Look Up Table: 
LUT 
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Because multiple paths can pass through the routing matrix, this 


configuration can be catestrophic - I.e., break simple mitigation 
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Fixing SRAM-based 
Configuration...Scrubbing Definition 


e From SEUwU testing, it has been shown that the 
configuration memory of radiation un-hardened 
SRAM-Based FPGAs is highly susceptible to 
SEUs. 


e We address configuration susceptibility via 
scrubbing: Scrubbing is the act of simultaneously 
writing into FPGA configuration memory as the 
device’s functional logic area is operating with 
the intent of correcting configuration memory bit 
errors. 


Configuration scrubbing only pertains to 
SRAM-based configuration devices. 
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Warning! Nasal 


e Fixing a configuration bit does not mean that you 
have fixed the state in the functional logic path. 

e In order to guarantee that the functional logic is 
in the expected state after the configuration bit is 
fixed, either the state must be restored or a reset 
must be issued. 


Reliably getting to an expected state after a 
configuration-bit SEU (that affects the design’s 
functionality) requires one of the following: 


— Fix configuration bit + (reset or correct DFFs) or 
— Full reconfiguration. 
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, ,, =xample: Routing Configuration Ss 
Upsets in a Xilinx Virtex FPGA 


Look Up Table: 
LUT 


I 2 3 


| ait 
LUT o BA F 


a 


Configuration + design state must be corrected after a configuration 
SEU hit. 
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Single Event Upsets in an FPGA’s Functional “a5 
Data Path and Fail-Safe Strategies 


a 
P configuration +P( f: Ly) functionalLogict P SEFI 
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Data-path SEUs and Their Affect At The Ss 


System Level 


Each data path in an FPGA device isa 
cascade of sequential and combinatorial 
logic. 

The occurrence of an SET or SEU does not 
definitively cause system error. 


Probability of a system error due to an 

SEU depends on many factors: 

— Probability of fault generation in a gate (SET or 
SEU). 


— Probability of error propagation — will the SET 
or SEU force the system’s next state to be 
incorrect? 
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Error Propagation in A Data-Path: Ss 
SEU De-rating 


SEUs usually occur between clock edges(during system 
next-state calculation): A system-level malfunction occurs if 
the event forces the system’s next state to be incorrect. 


¢ Capacitive filtration: data-path capacitance can stop 
transient upset propagation; e.g.: 
— Routing metal or heavy loading. 
— If a transient doesn’t reach a sequential element, then it most 
likely will not cause a system upset. 
e Logic masking: 
— Redundancy and mitigation of paths can stop upset propagation. 
— Turned off paths from gated logic can stop upset propagation. 


¢ Temporal delay: path delays can block temporary SEUs 
from disturbing next state calculation. 
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Fail-safe Strategies for Single Event Ss 
Upsets (SEUs) 
¢ The following slides will demonstrate commonly used 
mitigation strategies for FPGA devices. 
¢ What you should learn: 
— The differences between mitigation strategies. 
— Strengths and weaknesses of various strategies. 


— Questions to ask or considerations to make when 
evaluating mitigation schemes. 


— Which mitigation schemes are best for various 
types of FPGA devices. 


¢ The scope of this presentation will cover fail-safe 
strategies for configuration and data-path SEUs 
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Fail-Safe Strategies for FPGA 
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detection-recovery mechanisms via 
fail-safe strategies. 
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Differentiating Fail-Safe Strategies: 


Detection: 
— Watchdog (state or logic monitoring). 
— Can range from simplistic checking to complex Decoding. 
— Action (alerting, correction, or recovery). 
Masking (does not mean correction): 
— Preventing error propagation to other logic. 
— Requires redundancy + mitigation or detection. 
— Turn off faulty path. 
Correction (error may not be masked): 
— Error state (memory) is changed/fixed. 
— Need feedback or new data flush cycle. 
Recovery: 
— Bring system to a deterministic state. 
— Might include correction. 
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Redundancy Is Not Enough sal 


e Simply adding redundancy to a system is not enough 
to assume that the system Is well protected. 


¢ Questions/Concerns that must be addressed fora 
critical system expecting redundancy to cure all (or 
most): 
— How is the redundancy implemented? 


— What portions of your system are protected? Does the 
protection comply with the results from radiation testing? 


— Is detection of malfunction required to switch to a redundant 
system or to recover? 


— If detection is necessary, how quickly can the detection be 
performed and responded to? 


— Is detection enough?... Does the system require correction? 


Listed are crucial concerns that should be addressed at 
design reviews and prior to design implementation 
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Mitigation vasa 


¢ Error Masking vs. Error Correction... there’s a 
difference. 


¢ Mitigation can be: 
— User inserted: part of the actual design process. 


— Embedded: built into the device library cells. 
¢ User does not verify the mitigation — manufacturer does. 
e Mitigation should reduce error... 
— Generally through redundancy. 
— Incorrect implementation can increase error. 


— Overly complex mitigation cannot be verified and 
incurs too high of a risk to implement. 
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Questions to Ask: 
Availability versus Correct Operation 


Requirements must be satisfied. Is therea 
metric for mitigation trade-offs and risks 
regarding system requirements? 


What is your expected up-time versus down- 
time (availability)? 

Is correct operation well defined? E.g., correct 
operation can be defined as working as 
expected through error states after an SEU 
strike... however, this must be clearly stated in 
requirements. 


Is system failure well defined? 


Can availability and correct operation be 
deterministic regardless of error signature? 


Detection and Recovery S 
¢ Not all mitigation schemes require detection. 


Questions/Considerations — important review 
questions: 
— If your scheme requires detection: 
¢ Can the system detect all types of error signatures? 
¢ Can the system detect all error signatures fast 
enough? 
¢ Do different errors require different recovery 
schemes... can the system accommodate. 
— How are you going to verify the detection and 
recovery? 
— How much downtime will there be during recovery? 


“We know it will work” are not good enough answers: Ask 
how and if the scheme has been verified! 
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Embedded Mitigation versus User 
Inserted Mitigation 


Localized Triple Modular 
Dual Interlocked Cell (DICE 
aS Redundancy (LTMR) 
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Radiation Hardened (per SEU) versus 
Commercial FPGA Devices 


For this presentation, a radiation hardened (per SEU) 
device is a device that has embedded mitigation. 


Radiation hardened FPGA devices are available to 
users. They make the design cycle much easier! 


SEU mitigation is generally applied to the following: 
— Data-path elements: 
¢ Localized redundancy inserted into library cell flip-flops 
(DFFs). 
— Localized Triple Modular Redundancy (LTMR) or 
— Dual interlocked Cell (DICE) 
¢ SET filters inserted on the DFF data input pin. 
¢ SET filters inserted on the DFF clock input pin. 
— Global routes. 
— Memory cells. 
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Localized Redundancy Embedded in Ss 
the DFF Cells 


Warning! These figures are simplified schematics of the actual 
implementation. 


Localized Triple Modular 


Dual Interlocked Cell (DICE) Redundancy (LTMR) 


Xilinx Microsemi 


Problem! Although DFFs are protected, SETs from the 
combinatorial logic in the data path and SETs in the global 
routes can cause incorrect data to be captured by the DFF. 
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Embedded Temporal Redundancy (TR): SET Sy 
Filtration in The Data Path 


Temporal Filter placed directly before DFF. 
Localized scheme that reduces SET capture in the data path. 
Delays must be well controlled. 


— Every delay path shall consistently have a predefined delay and must 
be verified. 
Do not implement TR as a user inserted mitigation scheme. Delay 
must be deterministic and it is too difficult to manage with place 
and route tools. 


Maximum Clock frequency is reduced by the amount of new delay. 


D 
Combinatorial 
logic data path 
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Embedded Radiation Hardened Global Routes: Ss 


SET Filtration in The Global Route Path 
Clock Tree 


e Some FPGAs contain 
radiation-hardened clock 


Vv Vv 
trees and other global routes yy yf 
WW V V \V WW V V \V 


(Microsemi products only). 


e Global structures are 
generally hardened by using 


larger buffers. IE Z G a =n ge aa 


e TR has also been used on 


V VV VV VV WY VV VV VV VY 


the lowest leaves of the = 
clock trees... (Xilinx V5QV = 
only). 


Global route susceptibility is often overlooked. Beware, 
ony peices do uct pave ugigened gene oie 
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Radiation Hardened versus Commercial FPGA 


Device Geometries And Gate Count 
As Geometries Get Smaller, More Gates Are Available for Mitigation 
Courtesy of Syno 


Virtex UltraScale+ 
Kintex UltraScale+ 
Virtex UltraScale 


Kintex UltraScale dt 
Virtex-7 
Virtex-7Q - 
Stratix 5 
Virtex 5 i 
Virtex 5QV 
Virtex 4QV and Virtex 4 m 
RT-ProASIC nm 
RTAX-S nm 


’ 1 2 3 4 5 
me = SEU Hardened/Harder Logic Capacity - Millions 
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FPGA Devices Listed by Configuration Type 


(Not All Are Included in The List): Susceptibility 
DFF: flip flop 
Oxeyalicelele-ielola mm AY, eo mieeyalelam mi l-)me) 


SRAM 


Antifuse 


Flash 


Hardened SRAM 


DICE: Dual interlocked Cell 


DY =\V/(exom ors Tanti i (sss 


Stratix, Virtex, 
Kintex 


RTAX, RTSXS 


ProASIC3 


Virtex V5QV 


Embedded Most Susceptible 
Mitigation Components 


No 


DFFs and clocks 
(configuration is 
already hardened by 
nature) 


Configuration is 
already hardened by 
nature. 


Configuration + 
DICE DFFs + SET 
filters 


Go to http://radhome.gsfc.nasa.gov, manufacturer 
websites, and other space agency sites for more 
information on SEU data and total ionizing dose data. 
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Configuration 


Combinatorial logic 
(however 
susceptibility 
considered low) 


DFFs and clocks 


Clocks. In some 
cases additional 
mitigation may be 
necessary for 
configuration and 
DFFs 
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User Inserted Mitigation: 
Dual Redundancy, Cold Sparing, and 
Triple Modular Redundancy (TMR) 
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Dual Redundancy S 


Complex System Compare 


Alert, Mask, 


And Recover 


Dual redundant systems cannot correct (roll-back is an 
exception); they can only detect. 


“Compare and Alert” systems must be highly reliable 
and verifiable. 


Generally not all I/O can be monitored or compared. 


Best used for data calculation and manipulation... 
easiest to place compares on data buses. 
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Lockstep Dual Redundancy S 


Compare 


Alert, Mask, 


And Recover 
¢ Foralockstep, the dual complex systems are exact duplicates. 


e Synchronization is necessary. It is challenging and sometimes 
unpredictable. If not well managed, availability is affected. 


— Cache misses. 


— Care must be taken to synchronize asynchronous inputs (e.g., 
interrupts). 

— Complex algorithms for pipelining or memory management must 
be controlled for data to be exact duplicates and in lockstep. 
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Lockstep Compare S 


Compare 


Alert, Mask, 

And Recover 

¢ Best to use a “ data valid signal” to indicate data are 
ready for compare. 


¢ System performance can be affected because of data 
synchronization at the comparator. 

¢ Halt control will be necessary in case of “out-of-sync” 
data. 

> Should the comparator be hardened? 
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Lockstep Alert, Mask, and Recover S 


Complex System Compare 


Complex System 


Alert, Mask, 


¢ Masking must be instantaneous. And Recover 


e Where will the alert go upon a miscompare? 
— Internal or external device (watchdog). 


¢ What will be the system response to an alert? 
— Full reset versus partial resets. 
— FPGA (full reconfigure). 
— Power cycle. 
— Roll-back (correction) is an option. However, complex and 
unreliable. 


Usually system reset or power cycle is required upon SEU 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 3 8 


Synchronize 


Two Separate Systems Dual Ss 
Redundancy 


Complex System A 


Compare 


Relaxes the system synchronization requirements of 
being in lockstep. 


Compares operate on valid signals and/or message 
passing. Compares are generally more complex. 


Two separate systems can be complex to manage: 
— Data reordering and cache misses must be managed. 
— System halts must be carefully managed. 


If not well managed, system performance and 
availability will be significantly affected; and SEUs will 
not be taken care of as expected. 
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Cold Sparing: Elongation of System Sy 
Operation: 
One active system and alternate inactive systems. 


Upon active system failure, an inactive system is turned 
on. 


System operation is able to be elongated after failure. 


However: 

— Availability is not improved... there is downtime. 

— Can your system afford the downtime (critical application)? 
How clean is the system switch over? 
How long is the system switch over. 


Can the system ping-pong between active and inactive 
systems or is a system considered dead after failure? 


— Ping-ponging can be used for systems that have a low 
probability of destructive failures. 


— Ping-ponging can be complex and can affect availability. 
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System versus Design Mitigation S 


The previous slides were affiliated with system level 
mitigation. 

System level mitigation generally has: 

— Detection, masking, no correction, downtime, and recovery 

actions. 

The following slides will discuss triple modular 
redundancy (TMR) techniques that can be 
implemented as system or design-level mitigation. 


Most of the TMR techniques will incorporate masking 
and detection with no downtime (unless there is a 
single functional interrupt (SEFI)). 

Hence, TMR can improve system performance, 
availability, and elongate operation time. 
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Mitigation — Fail Safe Strategies That 
Do Not Require Fault Detection but 
Provide SEU Masking and/or 
Correction: 

Triple Modular Redundancy (TMR)... 
best two out of three. 
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HDL: Hardware description language 


How To Insert TMR into A Design: Nasal 
FPGA User Design Flow TM Reel Getritten 


Ur areaa(elarsl into the HDL. 
Specification Generally not done 
because too 
difficult. 


Output of Synthesis 

synthesis isa TMR can be 

gate netlist that a a = Inserted during 
represents the synthesis or post 
given HDL synthesis. 
function. 


If inserted post 
synthesis, the gate 
level netlist is 
replicated, ripped 
apart, and voters + 
feedback are 
inserted. 
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Local Mitigation versus Distributed or Ss 
Global Mitigation 


e Local mitigation: 
— Only DFFs are mitigated. 


— Mitigation will include masking and potential correction 
at the DFF. 


— Used with systems where DFFs are the most susceptible 
component cells. 
e Distributed or global mitigation: 


— The full design is mitigated with masking and 
correction. 


¢ Depending on the target device, the clock tree 
and other global routes may also need hardening. 
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Various TMR Schemes: Different Topologies Ss 


Block diagram of block Block diagram of local Block Diagram of 

TMR (BTMR): a complex ff ~TMR (LTMR): only flip- distributed TMR (DTMR): 
function containing flops (DFFs) are the entire design is 
combinatorial logic (CL) §_ triplicated and data- triplicated except for the 
and flip-flops (DFFs) is paths stay singular; global routes (e.g., clocks); 
triplicated as three voters are brought into voters are brought into the 
black boxes; majority the design and placed design and placed after the 
voters are placed at the in front of the DFFs. flip-flops (DFFs). DTMR 


outputs of the triplet. masks and corrects mast 


single event upsets (SEUs). 
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TMR Implementation Nasal 


¢ As previously illustrated, TMR can be implemented ina 
variety of ways. 


¢ The definition of TMR depends on what portion of the 
circuit is triplicated and where the voters are placed. 


¢ The strongest TMR implementation will triplicate all 
data-paths and contain separate voters for each data- 
path. 


— However, this can be costly: area, power, and 
complexity. 

— Hence a trade is performed to determine the TMR 
scheme that requires the least amount of effort and 
circuitry that will meet project requirements. 


e Presentation scope: Block TMR (BTMR), Localized TMR 
alt Distributed TMR Clubs! ele TMR eon 
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Block Triple Modular Redundancy: BTMR Sy 


Can Only 
Mask 
Errors 


C 3 3x the error rate with 
OPy => triplication and no 
correction/flushing 

e Need Feedback to Correct 

e Cannot apply internal correction from voted outputs 


¢ If blocks are not regularly flushed (e.g. reset), Errors 
can accumulate — - may not be an effective technique 
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Examples of a Flushable BTMR 
Designs 


e Shift Registers. 


¢ Transmission channels: It is typical for 
transmission channels to send and reset after 
every sent packet. 

e Systems that can be reset (or power-cycled) 
every so-often. 


Transmission channel example: 


If The System Is Not Flushable, Then Ss 
BTMR May Not Provide The Expected 
Level of Mitigation 


e BTMR can work well as a mitigation 
scheme if the expected MTTFE for each 
module is much greater than the expected 
time-window of correct operation. 


¢ But... If the expected time to failure for one 
block is close to the expected time-window 
of correct operation, then BTMR doesn’t 
buy you anything. 

¢ If not thought out well, BTMR can actually 
be a detriment — complexity, power, and 
area, ane ae sense o pecornaney: 
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Explanation of BTMR Strength and Weakness SS 
using Classical Reliability Models 


e- At 3 @ 2A @ 3M 1/A (5/6 A)= 0.833/A 


Reliablity across Fluence: Simplex Failures 
J ES 


System versus BTMR Version 
— System No TMR 
—BTMR System 


Time 


20.6 
2 0.5 Operating a BTMR 
@ design in this time 
ov 0.4 : : : 
interval will provide 
0.3 an increase in 
0.2 reliability. 
0.1 However, over time, 
0 BTMR reliability drops 


0 5000 10000 off faster thana 
Minutes system with No TMR. 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 5 0 


BTMR Bottom Line sal 


¢ How long does your BTMR system need to 
operate relative to the MTTF for one of its 
unmitigated blocks? 

¢ Overtime, a BTMR system has lower reliability 
than an unmitigated system. 

e Adding more replicated blocks (e.g., N-out-of-M) 
system will only increase the reliability during the 
short window near start time. However, overtime, 
the reliability of an N-out-of-M system will fall 
faster as M (the number of replicated blocks) 
grows. 
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What Should be Done If Availability @3y 
Needs to be Increased? 


¢ Ifthe blocks within the BTMR have a relatively high upset 
rate with respect to the availability window, then stronger 
mitigation must be implemented. 


¢ Bring the voting/correcting inside of the modules... bring 
the voting to the module DFFs. 


The following slides illustrate the various forms of TMR that 
include voter insertion in the data-path. 


TMR Description TMR 
Nomenclature Acronym 


Local TMR DFFs are triplicated LTMR 

Distributed TMR DFFs and CL-data-paths are DTMR 
triplicated 

Global TMR DFFs, CL-data-paths and global GTMR or 


routes are triplicated XTMR 
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Describing Mitigation Effectiveness Using oy 
A Model 


DFF: Edge triggered flip-flop CL: Combinatorial Logic 


P (f: S) pape P configuration ele (f: S) functionalLogic wile SEFI 


P(fS) prrseu —seu + P(fS) ser—seu 


Probability that an 7 
SEU ina DFF will Edits Ale an . 
manifest as an error BOLE 


in th t t manifest as an error 
EE EAL Sy olen in the next system 


clock cycle clock cycle 
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Local Triple Modular Redundancy (LTMR) vasa 


e Only DFFs are triplicated. Data-paths are kept singular. 


e LTMR masks upsets from DFFs and corrects DFF upsets if feedback is 
used. 

¢ Good for devices where DFFs are most 
susceptible and configuration and CL 
susceptibility is insignificant; e.g., 
Microsemi ProASICs3. 


P (f: S) errorOV P configuration ae (f: S) functionalLogic ek SEFI 


P( ee ae + P(fS) serseu 
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Windowed Shift Registers (WSRs): 
NEPP Test Structure 


N levels of Inverters 
between DFF stages: 
N = 0, 8, and 18 


DFF = D flip flop 
4-bit Window Output 


Shift Register Chain 


py ita) 


Combinatorial Logic: Inverters 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 


DS) 


Adding LTMR to a Microsemi ProASIC3 Ss 
Device versus RTAXs Embedded LTMR 


: LET: linear energy transfer. 
> At lower LETs, applying LTMR toa WSR: Test circuit...Windowed Shift Register. 


ProASIC3 design, has similar (a INV: Inverters between WSR stages. 


little higher) SEU response to ProASIC3: LTMR WSR 100MHz : 
1.00E-05 Checkerboard Pattern 


Microsemi RTAXs series. ene 
- Athigher LETs, clock tree upsets = & 1.0007 - ne 
start to dominate and LTMRinthe = — +r mINV=4 
ProASIC3 is not as effective. maa oer 
; eer 9 1.00E-10 + 
¢ Depending on your target radiation ° joe1 ol 


2.8 3.9 86 12.1 20.3 28.8 40.7 


environment, for most critical er usyecaa 
applications, the ProASIC3 SEU 1.00E-06 } RTAX4000D/RTAX2000 Shift Registers @ 80MHz 


= | checkerboard 
responses will produce acceptable 21 .0-07 Seen u ens bemccg 
upset rates.  1.00E-08 
“TT 4 ° E 1.00E-09 - 
®  o0E-10 7 rl —=—' RTAX4000D INV=8 
Saal Tefehel te) oa 8 4 —  RTAX4000D INV=0 
ina RTAXs ag 0 1.00E-11 =" RTAX2000 INV=8 
DFF cell. )>e 1.00E-12 + ' 
[ 0 20 40 60 80 
{| LET (MeV*cm2/mg) 
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LTMR Should Not Be Used in An Ss 
SRAM Based FPGA 


Look Up Table: 
LUT 


: LUT 
Proven via NEPP experiments: SEU data for LTMR implemented in Xilinx 
FPGA devices are similar or worse than no added mitigation. 
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Distributed Triple Modular Redundancy (DTMR) 
¢ Triple all data-paths and add voters after DFFs. 


e DTMR masks upsets from configuration + DFFs + CL and corrects 
captured upsets if feedback is used. 


¢ Good for devices where configuration or DFFs + CL are more 
susceptible than project requirements; e.g., 


Low Minimall 
+ Det ‘ 


P (f. 5) pene Os configuration sae (f cténalLogic Lowered 


Low 


01 Of La Jolla, CA, May 22-25, 2017. 
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Global Triple Modular Redundancy (GTMR) S 
Triple all clocks, data-paths and add voters after DFFs. 

GTMR has the same level of protection as DTMR; however, it also 
protects clock domains. 

Good for devices where configuration or DFFs + CL are more 
susceptible than project requirements; e.g., Xilinx and Altera 
commercial FPGAs. 


DFF 
oot Loge ill tos, 
DFF 


P (f 3 error Pp eOnjiguration Fle (f,) ferretiogtalL ogic wade SPT 


PS) peso Docu + P(Llowe sper: 
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Theoretically, GTMR Is The Strongest Ss 
Mitigation Strategy... BUT... 
¢ Triplicating a design and its global routes takes upa 
lot of power and area. 


e Generally performed after synthesis by a tool- not 
part of RTL. 


e Skew between clock domains must be minimized such 
that it is less than the shortest routing delay from DFF 
to DFF (hold time violation or race condition): 


— Does the FPGA contain enough low skew clock 
trees? (each clock + its synchronized reset)x3. 


— Limit skew of clocks coming into the FPGA. 


— Limit skew of clocks from their input pin to their 
clock tree. 


¢ Difficult to verify. 
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Xilinx Kintex UltraScale Mitigation Study Ss 


1.00E+08 No TMR 
First observ ed DTMR saptr Partition 
Partition failure - 
ADTMR Partition 


me 
1.00E+07 a = DTMR no Partition 


A @eLTMR 
x 
1.00E+06 
¢ * 
LL 
E ® 
LL 
= 1.00E+05 e 
. 
1.00E+04 bd 
LTMR was not tested at this = 
LET __ 
1.00E+03 


0 1 2 3 4 5 6 
LET MeV*cm2/mg 
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TMR and Verification sal 


If a system is required to be protected using triple 
modular redundancy (TMR), improper insertion 
can jeopardize the reliability and security of the 
system. 


Due to the complexity of the verification process 
and the complexity of digital designs, there are 
currently no available techniques that can 
provide complete and reliable confirmation of 
TMR insertion. 


Can you trust that TMR has been inserted as 
expected (correct topological scheme) and has 
not broken existing logic during the insertion 
process? 


We are working on It! 
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Currently, What Are The Biggest Challenges Sy 
Regarding Mitigation Insertion? 


¢ Tool availability... Synopsys is not quite ready for DTMR or GTMR. 


e User’s are not selecting the correct mitigation scheme for their 
target FPGA. 


e Mitigation is too complex to fully verify. 


Antifuse+LTMR: Microsemi 99992 
RTAX or RTSX family ne 


Commercial SRAM: Xilinx 
and Altera devices 


Bais | 
minininints . i, 
! i 
rr *S 


User versus Embedded Mitigation S 


e A subset of user inserted mitigation strategies 
have been presented. 


¢ None of the strategies are 100% fail-safe. 


e Depending on the project requirements, and the 
target device’s SEU susceptibility, the most 
efficient mitigation strategy should be selected. 


¢ In most cases, devices with embedded 
mitigation do not require additional (user 
inserted) mitigation. 

Beware of unhardened global routes. They do cause 

system upsets. 

V5QV (SIRF) and ProASIC3 have radiation hardened 

configuration but do not have hardened global routes. 
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Src 1] Dboys listo) relics \Vtclevell alot 
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Synchronous FSMs and SEUs S 


e Asynchronous FSM is designed to deterministically 
transition through a pattern of defined states 

e A synchronous FSM utilizes 
DFFs to hold its current Synchronous 
state, transitions to a next FSM Cock 
state controlled by a clock 
edge and combinatorial 
logic, and only accepts 
inputs that have been 
synchronized to the same 
clock 


¢ FSM SEUs can occur from: 
— Caught data-path SETs 
— DFF SEUs 
— Clock/Reset SETs 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 6 6 


Inputs 
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5-State FSM Binary Encoding Example S 


5-State Finite State-Machine 5-State Finite State-Machine 
_—_ Binary Encoding 


) State 0 
J State 4 State ih 


AS State 3 State 2 


a. 


Te of a an FSM "aed to Sonic a 5-State FSM with each state 
peripheral device encoded as binary numbers. 


An SEU can change current state and cause a 
catastrophic event 
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How Do We Implement Fail-Safe Ss 
FSMs? 


¢ Question: A designer states that all FSMs 
have been implemented as “safe”, what do 
you expect? 

¢ Correction? Detection? Masking? 
— What does correction mean? 


— All mitigation shall be defined unambiguously 
by the requirements and by the designer. 
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Safe State Machines Ss 


¢ As currently defined by design tools and by some 
designers, the term “safe” state machine is a misnomer. 


¢ Auto transitioning (“safe state-machine” ) is a reaction to 
a small subset of incorrect transitions (unmapped states). 
They do not correct or mask (protect) against incorrect 
transitioning. 


5-State Finite State-Machine oo WHEY e) of =Xo mols 
Binary Encoding OT alaarele) efeze, 


000 Yes 
A 001 Yes 
What happens If or ' 
es 
an SEU causes a 
[oe 011 Yes 
transition from 
“001” to “1017? 1? —Cs«*Y8S 
101 No 
110 No 


111 No 
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Safe State Machines: What happens if an Ss 
SEU causes a transition from “001” to 
“101” ? 


¢ As currently implemented, a “safe” state machine will 
automatically transition to a reset (or “safe” state). 


e Problem: this could be detrimental to your system 
5-State Finite State-Machine oa 
Binary Encoding Unmapped 
000 Yes 
000. 001 Yes 
Co oO 010 Yes 


011 Yes 

100 Yes 
A S 101 No 

111 No 
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Problems with Current “Safe” FSM Ss 


Definition 
¢ Sounds more safe than | _ 
what it really is. 5-State Finite State-Machine 


¢ Does not do anything for 
incorrect transitions into 
mapped states. 


¢ Does not correct the state: 


— Something that is supposed to / 
be on will abruptly shut off. | 


— Other FSMs or control logic 
can become unsynchronized 
with the bad FSM; with or 
without the automated jump to 
a “safe” state. 
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Can Auto-transitioning Work for Your Sy 
Mission? 
¢ Auto-transitioning can work if 
incorrect sequencing of your FSM 
will not cause system failure; e.g. 
mathematical logic control. 


¢ Auto-transitioning can be 
acceptable if it is used in | 
conjunction with a detection flag. \ 
The detection flag must propagate ~ 
to all necessary logic. 


¢ But remember, there is no 
protection or detection with auto- 
transitioning when incorrectly 
transitioning to a mapped state. 
Auto-transitioning + detection is available with computer 
aided design (CAD) tools, 
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5-State Finite State-Machine 
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Implementing Corrective Logic for FSMs Sy 
¢ ASICs or FPGAs with hardened configuration: 
— LTMR: Triplicate each DFF and use a majority voter. 
¢ The triplication + voter is treated as one DFF 
¢ Encoding doesn’t change 
¢ Resultant FSM has 3 times the number of DFFs 
than the original encoding scheme. 
¢ Combinatorial logic (not including the voters) 
does not change 
— Hamming Code-3: requires a new encoding scheme. 


¢ FPGAs with commercial SRAM configuration: 


DTMR is suggested. 
There are computer aided design tools (CAD) that can 
assist in adding all of the above mitigation strategies. 


Be careful regarding unhardened (per SEU) global routes. 
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FSM Fault Tolerance: Ss 


5-State Conversion to a Hamming Code-3 FSM 


5-State Binary Finite State-Machine 
Converted to Hamming -3 FSM 
Companion 


OO 
aes State — O O 
No SEU 
x iS © 


@° 
State 0 Cc 000 y | 00 
VB Cc E> 001000 
000001 


ee 


A closer look at a base-state 


State 0 (State IDLE) and Its 
Hamming-3 Companion States 


Hamming Code-3 FSM Diagram for a5 ‘ ; 
Base-State FSM: Would need 5*7=35 (state 0) and its companion- 
FSM states to be represented... 6 DFFs states 
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Some Thoughts 
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Concerns and Challenges of Today 
and Tomorrow for Mitigation Insertion 


User insertion of mitigation strategies in most FPGA and ASIC 
devices has proven to be a challenging task because of reliability, 
performance, area, and power constraints. 
— Difficult to synchronize across triplicated systems, 
— Mitigation insertion slows down the system. 
— Can't fit a triplicated version of a design into one device. 
— Power and thermal hot-spots are increased. 
The newer devices have a significant increase in gate count and 
lower power. This helps to accommodate for area and power 
constraints while triplicating a design. However, this increases the 
challenge of module synchronization. 
Embedded mitigation has helped in the design process. However, it 
is proving to be an ever-increasing challenge for manufacturers. 
— We (users) want embedded systems: cheaper, faster, and less power 
hungry. 
— However, heritage has proven that for critical applications, embedded 
systems have provided excellent performance and reliability. 
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Summary sal 


¢ For critical applications, mitigation may be required. 


e Determine the correct mitigation scheme for your mission while 
incorporating given requirements: 


— Understand the susceptibility of the target FPGA and 
potential necessity of other devices. 


— Investigate if the selected mitigation strategy is compatible to 
the target FPGA device. 


— Calculate the reliability of the mitigation strategy to determine 
if the final system will satisfy requirements. 


— Ask the right questions regarding functional expectation, 
mitigation, requirement satisfaction, and verification of 
expectations. 


e Although it is desirable from a user’s perspective to have 
embedded mitigation, cost seems to be driving the market 
towards unmitigated commercial FPGA devices. Hence, it will be 
necessary for user’s to familiarize themselves with optimal 
mitigation insertion and usage. 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 


(ai 


Acknowledgements S 


¢ Some of this work has been sponsored by the 
NASA Electronic Parts and Packaging (NEPP) 
Program and the Defense Threat Reduction 
Agency (DTRA). 

¢ Thanks Is given to the NASA Goddard Radiation 
Effects and Analysis Group (REAG) for their 
technical assistance and support. REAG Is led by 
Kenneth LaBel and Jonathan Pellish. 


Contact Information: 


Melanie Berg: NASA Goddard REAG FPGA 
Principal Investigator: 


melanie: D. eB ONAS: oo 


To be presented by Melanie D. Berg at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017. 


78 


